Consider the 3-dimensional data set generated by the following code.
library(plot3D)
library(princurve)
t <- seq(-1.5*pi,1.5*pi,l=100)
R<- 1
n<-75
sd.eps <- .15
set.seed(1)
y <- R*sign(t) - R*sign(t)*cos(t/R)
x <- -R*sin(t/R)
z <- (y/(2*R))^2
rt <- sort(runif(n)*3*pi - 1.5*pi)
eps <- rnorm(n)*sd.eps
ry <- R*sign(rt) - (R+eps)*sign(rt)*cos(rt/R)
rx <- -(R+eps)*sin(rt/R)
rz <- (ry/(2*R))^2 + runif(n,min=-2*sd.eps,max=2*sd.eps)
XYZ <- cbind(rx,ry,rz)
require(plot3D)
lines3D(x,y,z,colvar = NULL,
phi = 20, theta = 60, r =sqrt(3), d =3, scale=FALSE,
col=2,lwd=4,as=1,
xlim=range(rx),ylim=range(ry),zlim=range(rz))
points3D(rx,ry,rz,col=4,pch=19,cex=.6,add=TRUE)
When fitting principal curves to these data, use the function
princurve::principal_curve with the following options:
smoother="smooth_spline". This is the default, so you
do not need to use it explicitely.smooth_spline will be the degrees of freedom df (see
`help(smooth.spline) if you want)For instance, the following sentence
principal_curve(XYZ, df=6)
fits the required principal curve with degrees of freedom
df equal to 6.
df by
leave-one-out cross-validation. Restrict the search of df
to seq(2,8,by=1). (Hint: The function
project_to_curve should be used. See the element
dist of the object it returns).loocv <- function(data,k){
distance=0
for(i in 1:dim(data)[1]){
pred=princurve::principal_curve(data[-i,],df=k)
pred_proj= princurve::project_to_curve(t(matrix(data[i,])),pred$s[pred$ord,])
distance= distance+ pred_proj$dist
}
return(distance)
}
K=seq(2,8,by=1)
distances=c()
for(k in K){
distances=append(distances,loocv(XYZ,k))
}
min_k=K[which.min(distances)]
plot(K, distances)
abline(v=min_k,col=2)
df and comment on the obtained results.The principal curve with the optimal df=6 fits the data by going through it in an S-like shape. It captures the data trend without grasping the points.
best_df <- min_k
best_pc_fit <- principal_curve(XYZ, df = best_df)
# Project your data onto the selected principal curve
projected_data <- project_to_curve(XYZ, best_pc_fit$s[best_pc_fit$ord, ])
# Assess the quality of the projection by examining the dist element
projection_distances <- best_pc_fit$dist
cat("Best df:", best_df, "\n")
## Best df: 6
cat("Projection Distances:", projection_distances, "\n")
## Projection Distances: 5.729252
plot(best_pc_fit,xlim=range(XYZ[,1]),ylim=range(XYZ[,2]),asp=1)
points(XYZ,col=4)
# 3D PLOT OF THE CURVE
curve_coords = best_pc_fit$s[best_pc_fit$ord,]
lines3D(curve_coords[,1],curve_coords[,2],curve_coords[,3],colvar = NULL,
phi = 30, theta = 60, r =sqrt(3), d =3, scale=FALSE,
col=2,lwd=4,as=1,
xlim=range(XYZ[,1]),ylim=range(XYZ[,2]),zlim=range(XYZ[,3]))
points3D(XYZ[,1],XYZ[,2],XYZ[,3],col=4,pch=19,cex=.6,add=TRUE)
df=50 and compare it with the result corresponding to the
optimal df value you found before.df=50 and based
only on the leave-one-out cross-validation error values, what value for
df do you think that is better, the previous optimal one or
df=50?df=50 and plot the
fitted curve in the 3D scatterplot of the original points.df do you prefer?df=50 is clear. Nevertheless
leave-one-out cross-validation has not been able to detect this fact.
Why do you think that df=50 is given a so good value of
leave-one-out cross-validation error?distance=loocv(XYZ,50)
cat("LOOCV for df=50: ", distance, "\n")
## LOOCV for df=50: 2.954621
cat("LOOCV for optimal df=6:", distances[min_k])
## LOOCV for optimal df=6: 9.558809
What value for df do you think that is better,
the previous optimal one or df=50? Without
plotting the principal curve, the cross-validation error for
df=50 seems way better than the one for the optimal df. It
is very small and seems almost too small to be true.
pred=princurve::principal_curve(XYZ,df=50)
lines3D(pred$s[,1],pred$s[,2],pred$s[,3],colvar = NULL,
phi = 20, theta = 60, r =sqrt(3), d =3, scale=FALSE,
col=2,lwd=4,as=1,
xlim=range(XYZ[,1]),ylim=range(XYZ[,2]),zlim=range(XYZ[,3]))
points3D(XYZ[,1],XYZ[,2],XYZ[,3],col=4,pch=19,cex=.6,add=TRUE)
What value of df do you prefer? When
looking at the plot of the principal curve with df=50 it is
clear to see that the curve is totally overfitting on the data we
generated and does not generalize well. Because of that, we prefer our
optimal df=6.
Why do you think that df=50 is given a so good
value of leave-one-out cross-validation error? By increasing
the degrees of freedom in the principal curve, the curve becomes more
flexible/a more complex model. So, the model is able to capture noise in
the data. LOOCV is sensitive to variance in the data and a more flexible
model may reduce the bias leading to a low LOOCV error, but it does not
generalize well on new data. LOOCV is not able to capture the
overfitting because the principal curve with 50 degrees of freedom is
very long and interpolates the data points, so the curve becomes dense
in some way. This makes the distance to the projections of external
points to the curve smaller, thus making the LOOCV error closer to
zero.
Consider the ZIP number data set, from the book of Hastie et al. (2009). Read the training data set (in the file zip.train) and select only the ZEROs.
There are n=1194 digits corresponding to ZEROs in the
data set.
The following function plots a digit:
zip.train <- read.table("zip.train")
data=zip.train[zip.train$V1==0,][,-1]
print(dim(data)[1])
## [1] 1194
# ploting 1 digit
plot.zip <- function(x,use.first=FALSE,...){
x<-as.numeric(x)
if (use.first){
x.mat <- matrix(x,16,16)
}else{
x.mat <- matrix(x[-1],16,16)
}
image(1:16,1:16,x.mat[,16:1],
col=gray(seq(1,0,l=12)),...)
invisible(
if (!use.first){
title(x[1])
}else{
}
)
#col=gray(seq(1,0,l=2)))
}
Here you have several examples of these ZERO digits:
You must apply Local MDS to reduce the dimensionality of this dataset
using the function lmds from package stops.
You have to install the library stops from this link and
then to attach the library:
#if (!require(stops, quietly=TRUE, warn.conflicts=FALSE)){
# install.packages("stops", repos="http://R-Forge.R-project.org",INSTALL_opts="--no-test-load")
#}
library(stops)
## Loading required package: smacof
## Loading required package: plotrix
## Loading required package: colorspace
## Loading required package: e1071
##
## Attaching package: 'smacof'
## The following object is masked from 'package:base':
##
## transform
## Loading required package: rgl
##
## Attaching package: 'rgl'
## The following object is masked from 'package:plotrix':
##
## mtext3d
##
## Attaching package: 'stops'
## The following object is masked from 'package:smacof':
##
## torgerson
## The following object is masked from 'package:stats':
##
## cmdscale
# help(lmds)
(q=2) configuration of the
data using parameters k=5 and tau=0.05 in
lmds function. Do the scatterplot of the obtained
2-dimensional configuration.x=dist(as.matrix(data))
n <- dim(x)[1]
k <- 5
tau <- .05
q <-2
conf <- stats::cmdscale(x, k=q)
zero <- lmds(as.matrix(x), init=conf, ndim=q, k=k, tau=tau, itmax = 1000)
n <- dim(x)[1]
k <- 5
tau <- .05
q<-2
# op <- par(mfrow=c(2,1))
plot(zero$conf,as=1, main=paste0("Local MDS, k=",k,", tau=",tau))
text(zero$conf[,1],zero$conf[,2],1:n,pos=3, cex=0.7)
plot.zip to plot the ZERO digits corresponding to the
selected points. The images you are plotting should allow you to give an
interpretation of the 2 coordinates obtained by Local MDS (observe how
the shape of ZEROs changes when moving along each directions of the
scatterplot). By doing this we have seen that the second coordinate from
local MDS represents how thick the line drawn is and the first
coordinate represents if the circle is more horizontally deformed or
vertically deformed.pos_names = c("left_top", "top", "right_top", "left", "middle", "right", "left_bot", "bot", "right_bot")
sel_points = c("638", "7290", "5727", "4808", "4465", "737" , "6795", "5697", "3572")
op <- par(mfrow = c(3,3))
for (i in 1:9){
plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}
Another approach would be to detect 9 clusters in the data configuration obtained by local MDS and use the centroids as points for visualizing the numbers. But with this approach there is not a clear pattern visible as in the approach described above.
# select 9 points that cover the variability by looking for 9 clusters over the data and using their centroids
k=kmeans(zero$conf,9)
plot(zero$conf, col = k$cluster, main="K-Means Clustering on Local MDS")
points(k$centers)
library(FNN)
points=get.knnx(zero$conf, k$centers, 1)
points_x1=names(sort(zero$conf[points$nn.index,][,1]))
layout(matrix(k$cluster[points_x1], nrow = 3, ncol = 3))
for (p in points_x1){
plot.zip(main=k$cluster[p],data[p,],use.first = TRUE)
}
k and Ï„ in Local MDS for ZERO
digits. Then describe graphically the low dimensional configuration
corresponding to the optimal parameters. Indication: As tentative values
for k use c(5,10,50), and for Ï„ use
c(.1,.5,1).LCMC <- function(D1,D2,Kp){
D1 <- as.matrix(D1)
D2 <- as.matrix(D2)
n <- dim(D1)[1]
N.Kp.i <- numeric(n)
for (i in 1:n){
N1.i <- sort.int(D1[i,],index.return = TRUE)$ix[1:Kp]
N2.i <- sort.int(D2[i,],index.return = TRUE)$ix[1:Kp]
N.Kp.i[i] <- length(intersect(N1.i, N2.i))
}
N.Kp<-mean(N.Kp.i)
M.Kp.adj <- N.Kp/Kp - Kp/(n-1)
return(list(N.Kp.i=N.Kp.i, M.Kp.adj=M.Kp.adj))
}
K <- c(5,10,50)
tau <- c(.1,.5,1)
n <- dim(x)[1]
q<-2
D1=x
Kp=10
LC.lmds <- matrix(0,nrow=length(K),ncol=length(tau))
lmds <- array(vector("list",1),dim=dim(LC.lmds))
conf <- stats::cmdscale(x, k=q)
for (i in 1:length(K)){
for (j in 1:length(tau)){
lmds[[i,j]] <- lmds(as.matrix(x), init=conf, ndim=q, k=K[i], tau=tau[j], itmax=1000)$conf
D2 <- dist(lmds[[i,j]])
LC.lmds[i,j] <- LCMC(D1,D2,Kp)$M.Kp.adj
#print(c(i,j,LC.lmds[i,j]))
}
}
ij.max <- arrayInd(which.max(LC.lmds),.dim=dim(LC.lmds))
k.max <- K[ij.max[1]]
tau.max <- tau[ij.max[2]]
lmds.max <- lmds[[ij.max[1],ij.max[2]]]
#lmds.max <- lmds(as.matrix(x), init=conf, ndim=q, k=5, tau=1, itmax=1000)$conf
plot(K, LC.lmds[,1], type="b", main=paste0("K=",round(k.max,4)))
abline(v=k.max,col=2)
plot(tau, LC.lmds[,2], type="b", main=paste0("tau=",round(tau.max,4)))
abline(v=tau.max,col=2)
After using the LCMC we can observe below the optimal configuration from the given pull of parameters. When comparing it with the scatter plot from the previous configuration, we can see that the aggregation of points is more similar to a straight line. Data representations happen to be more dispersed among the first dimension, while second dimension range seems to be preserved. This means that the first and second dimension expose higher correlation.
print(paste0("k.max=",k.max,"; tau.max=",tau.max))
## [1] "k.max=5; tau.max=1"
plot(lmds.max,as=1, main=paste0("Local MDS, k=",k.max,", tau=",tau.max))
text(lmds.max[,1],lmds.max[,2],1:n,pos=3, cex=.7)
isomap from package
vegan. Do the scatterplot of the obtained 2-dimensional
configuration.library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
x=dist(as.matrix(data))
q<-2
k <- 5
ismp <- isomap(x,ndim=q, k=k)
plot(ismp,n.col=3,main="Output of ISOMAP Algorithm")
text(ismp$points[,1],ismp$points[,2],rownames(data),pos=3, cex=.5)
plot.zip to plot the ZERO digits corresponding to the
selected points. The images you are plotting should allows you to give
an interpretation of the 2 coordinates obtained by ISOMAP (observe how
the shape of ZEROs changes when moving along each directions of the
scatterplot).In the following figure we can appreciate the relations between the coordinates defined by ISOMAP and the types of zeroes drawn. We can see that the ones with lower values for dimension 2 are the ones that have lines connecting them to the previous digit of the ZIP code. The ones with high value for dimension 2 are more round and less dense, with less painted pixels.
For the other dimension we can say that the zeros with high value are more round and large and the ones with low values are thin and tall.
It is also visible that the zeros with values around the 0 for dimension 2 are drawn with thick lines.
pos_names = c("left_top", "top", "right_top", "left", "middle", "right", "left_bot", "bot", "right_bot")
sel_points = c("522", "5026", "6989", "7069", "4774", "364" , "6659", "1126", "1884")
op <- par(mfrow = c(3,3))
for (i in 1:9){
plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}
c(5,10,50).library(vegan)
k_values <- c(5, 10, 50)
q=2
D1 <- dist(data)
LC.ismp <- numeric(length(k_values))
isomap_results <- vector("list",length(k_values))
for (i in 1:length(k_values)){
k <- k_values[i]
isomap_results[[i]] <- isomap(D1, ndim=q,
k=k)
dist_k <- dist(isomap_results[[i]]$points[,1:q])
LC.ismp[i] <- LCMC(D1, dist_k, k)$M.Kp.adj
}
i_max <- which.max(LC.ismp)
k_max <- k_values[i_max]
isomap_max <- isomap_results[i_max]
plot(k_values, LC.ismp, type="b", main=paste0("K=",round(k_max,4)))
abline(v=k_max,col=2)
best_k_isomap <- isomap(D1, ndim=q, k=k_max)
pairs(cbind(best_k_isomap$points[,1], best_k_isomap$points[,2] ,lambda=best_k_isomap$points[,1]), pch=20, main="Best ISOMAP output in 2-dim")
plot(best_k_isomap$points[,1], best_k_isomap$points[,2], pch=20, main="Best ISOMAP output in 2-dim", as=1)
text(best_k_isomap$points[,1], best_k_isomap$points[,2], rownames(data),pos=3, cex=.5)
To explain this low dimensional representation we will use the same
method as before. Through this method we have seen that the vertical
dimension is correlated with the density of the image, how black the
zero is. The first dimension (the horizontal one) seems correlated with
the roundness of the zero, being the round zeros the ones with larger
values in this dimension.
pos_names = c("left_top", "top", "right_top", "left", "middle", "right", "left_bot", "bot", "right_bot")
sel_points = c("2695", "5542", "5911", "6381", "5626", "4964", "6462", "6890", "4898")
op <- par(mfrow = c(3,3))
for (i in 1:9){
plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}
You must apply t-SNE to reduce the dimensionality of this dataset
using the function Rtsne from package Rtsne.
library(Rtsne)
# help(Rtsne)
perplexity=40 and theta=0 in
Rtsne function. Do the scatterplot of the obtained
2-dimensional configuration.x=dist(as.matrix(data))
set.seed(42)
theta= 0.0
perp = 40
tsne_out <- Rtsne(x, pca=FALSE, perplexity=perp, theta=theta)
plot(tsne_out$Y, asp=1, main="t-SNE on ZIP data")
text(tsne_out$Y[,1], tsne_out$Y[,2], rownames(data), pos = 3, cex = .7)
plot.zip to plot the ZERO digits corresponding to the
selected points. The images you are plotting should allows you to give
an interpretation of the 2 coordinates obtained by t-SNE (observe how
the shape of ZEROs changes when moving along each directions of the
scatterplot).Proceeding as before we see that the zeros at the bottom are the ones having deviations from the usual circle shape. These are the ones that have additional segments, when the line that defines the zero is longer than it should be so it has an external ramification. On the other side, at the top part of the plot there are the zeros whose circle is incomplete, that the line is too short. We could say that the other dimension is related to the vertical or horizontal deformation of the zero but it is not something very clear.
pos_names = c("left_top", "top", "right_top", "left", "middle", "right", "left_bot", "bot", "right_bot")
sel_points = c("5331", "2001", "4391", "5819", "5333", "4272", "5532", "4614", "6890")
op <- par(mfrow = c(3,3))
for (i in 1:9){
plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}
perplexity in t-SNE for ZERO digits (use q=2 and
theta=0). Then describe graphically the low dimensional
configuration corresponding to the optimal parameter. Indication: As
tentative values for perplexity use
c(10,20,40).require(Rtsne)
set.seed(4444)
D1 <- dist(data)
Kp <- 10
q <- 2
theta = 0
perplexity <- c(10,20,40)
LC.tsne <- numeric(length(perplexity))
Rtsne.k <- vector("list",length(perplexity))
for (i in 1:length(perplexity)){
Rtsne.k[[i]] <- Rtsne(D1, perplexity=perplexity[i], dims=q,
theta=0, pca=FALSE, max_iter = 1000)
D2.k <- dist(Rtsne.k[[i]]$Y)
LC.tsne[i] <- LCMC(D1,D2.k,Kp)$M.Kp.adj
}
i.max <- which.max(LC.tsne)
perplexity_max <- perplexity[i.max[1]]
best_rtsne <- Rtsne.k[[i.max]]
plot(perplexity,LC.tsne, main=paste0("perplexity max=",perplexity_max))
abline(v=perplexity[i.max],col=2)
pairs(cbind(best_rtsne$Y[,1], best_rtsne$Y[,2] ,lambda=best_rtsne$Y[,1]), pch=20, main="Best t-SNE output in 2-dim")
pairs applied to a
6-dimensional matrix.With the following figure we see that the first dimension of the three dimensionality reduction methods that we have used is very similar. The plots between first dimensions from different methods show a clear correlation, close to the identity function (or minus the identity function) with some noise. The same effect is also present between the second dimensions of the Local MDS and the ISOMAP.
We can say that Local MDS and ISOMAP give very similar new dimensions, except for a sign.
For the t-SNE the resemblance with the other two second dimensions is not that clear. It is visible that there is a correlation but the noise in this case is way higher, since the scatterplots are not as close to a straight identity line.
pairs(cbind(lmds1=lmds.max[,1], lmds2=lmds.max[,2],ismp1=best_k_isomap$points[,1],ismp2=best_k_isomap$points[,2],tsne1=best_rtsne$Y[,1],tsne2=best_rtsne$Y[,2]))
The method that has produced the largest value for LCMC is, by far, t-SNE.
cat("local MDS: ",max(LC.lmds), "\n")
## local MDS: 0.2628071
cat("ISOMAP: ",max(LC.ismp), "\n")
## ISOMAP: 0.3574188
cat("t-SNE: ",max(LC.tsne),"\n")
## t-SNE: 0.5061906